feat: estimate cycles by not-matthias · Pull Request #17 · CodSpeedHQ/valgrind-codspeed

not-matthias · 2026-06-09T16:26:27Z

No description provided.

codspeed-hq · 2026-06-09T18:17:34Z

Merging this PR will improve performance by ×2.3

⚡ 9 improved benchmarks
❌ 2 regressed benchmarks
✅ 29 untouched benchmarks
⏩ 80 skipped benchmarks¹

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Benchmark	`BASE`	`HEAD`	Efficiency
❌	`test_valgrind[valgrind-3.26.0, python3 testdata/test.py, full-no-inline]`	6.8 s	16.2 s	-57.94%
❌	`test_valgrind[valgrind-3.25.1, python3 testdata/test.py, full-with-inline]`	6.9 s	8.7 s	-20.69%
⚡	`test_valgrind[valgrind-3.25.1, echo Hello, World!, full-with-inline]`	612,627.1 ms	734.8 ms	×830
⚡	`test_valgrind[valgrind-3.25.1, stress-ng --cpu 4 --cpu-ops 10, full-no-inline]`	5.3 s	3.2 s	+65.08%
⚡	`test_valgrind[valgrind-3.26.0, stress-ng --cpu 4 --cpu-ops 10, full-no-inline]`	5.3 s	3.2 s	+64.34%
⚡	`test_valgrind[valgrind-3.25.1, stress-ng --cpu 4 --cpu-ops 10, full-with-inline]`	5.5 s	3.4 s	+61.54%
⚡	`test_valgrind[valgrind-3.26.0, stress-ng --cpu 4 --cpu-ops 10, full-with-inline]`	5.6 s	3.5 s	+61.15%
⚡	`test_valgrind[valgrind-3.26.0, stress-ng --cpu 4 --cpu-ops 10, no-inline]`	3.1 s	2 s	+51.99%
⚡	`test_valgrind[valgrind-3.25.1, stress-ng --cpu 4 --cpu-ops 10, no-inline]`	3.1 s	2 s	+51.31%
⚡	`test_valgrind[valgrind-3.25.1, stress-ng --cpu 4 --cpu-ops 10, inline]`	3.3 s	2.2 s	+47.3%
⚡	`test_valgrind[valgrind-3.26.0, stress-ng --cpu 4 --cpu-ops 10, inline]`	3.3 s	2.2 s	+46.97%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing feat/cycle-estimation (2bc9a1c) with master (fa9ee2e)}

80 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

greptile-apps · 2026-06-12T06:55:57Z

Greptile Summary

This PR adds per-instruction cycle estimation to Callgrind by integrating Capstone for real-time instruction decoding and generated LUT files (x86_caps_lut.inc, arm64_caps_lut.inc) that map packed instruction signatures to throughput and latency centi-cycle costs (Ct/Cl). It is enabled via a new --cycle-estimation=yes flag and wires the costs into the existing event-group framework alongside the cache and branch simulators.

New decoder pipeline: cycledecode_capstone.c opens a Capstone handle at post_clo_init time with all libc calls forwarded to Valgrind's freestanding coregrind libc; cycledecode.c performs a two-level LUT lookup (exact signature → width-agnostic → per-instruction default → 1.00-cycle flat fallback) and accumulates per-BB running sums for O(1) inclusive-cost updates at side exits.
Build system: Capstone is now a mandatory build dependency detected by configure.ac; all CI, release, and debian/rules paths switch to --enable-only64bit to avoid compiling the 32-bit secondary target without Capstone. A new composite GitHub Action builds the static Capstone and exports CAPSTONE_DIR.

Confidence Score: 5/5

Safe to merge; both findings are non-blocking edge cases that do not affect the common execution path.

The core decode-and-lookup pipeline, cost accumulation in cachesim_add_icost and setup_bbcc, and the build system wiring are all correct. Both findings are narrow edge cases that do not affect normal amd64/arm64 operation.

callgrind/main.c (zero-length IMark handling) and callgrind/cycledecode_capstone.h (i386 mode guard)

Important Files Changed

Filename	Overview
callgrind/main.c	Adds cycle cost decode at IMark time and running-sum compute at BB finalisation; contains a redundant ternary for `len` that passes `VG_MIN_INSTR_SZB` instead of 0 to Capstone when VEX reports an undecodable instruction.
callgrind/cycledecode_capstone.h	New file: arch selection + Capstone handle API; `CS_MODE_64` is used for both x86_64 and i386 hosts, which would silently decode 32-bit instructions incorrectly on an i386 build.
callgrind/cycledecode.c	New file: per-instruction cycle cost lookup using Capstone + binary-searched LUT; logic is correct, with two-level fallback (width-agnostic retry, then per-instruction default).
callgrind/cycledecode_capstone.c	New file: Capstone bridge + libc shims for nodefaultlibs Valgrind tool; shims follow coregrind conventions.
callgrind/sim.c	Registers EG_CYCLES event group and wires per-instruction Ct/Cl cost accumulation into cachesim_add_icost; follows the existing conditional-register / unconditional-add-to-full pattern for EG_ALLOC and EG_SYS.
callgrind/bbcc.c	Adds inclusive cycle cost (ct_incl/cl_incl) updates at side-exit handling for both skipped and non-skipped paths; guarded by cycle_estimation flag.
callgrind/global.h	Adds cycle_estimation CLI flag, four UInt cycle-cost fields to InstrInfo, and the EG_CYCLES=9 event group constant.
callgrind/Makefile.am	Adds cycledecode.c and cycledecode_capstone.c to CALLGRIND_SOURCES_COMMON and CAPSTONE_CFLAGS/LIBS to the primary target only; secondary target is avoided by --enable-only64bit in all build paths.
configure.ac	Adds mandatory Capstone detection (--with-capstone or $CAPSTONE_DIR); errors clearly if missing.
bench/generate_config.py	Extends CONFIGS with requires_codspeed flag; adds CODSPEED_VERSION constant and should_skip guard so --cycle-estimation configs are omitted for upstream Valgrind builds.
.github/actions/build-capstone/action.yml	New composite action: builds a static Capstone (x86+arm64 only, no stack-protector/fortify) and exports CAPSTONE_DIR to subsequent steps.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Ist_IMark during CLG_instrument] -->|cycle_estimation=yes, !seen_before| B[clg_cycle_cost bytes from cia len]
    B --> C{cs_disasm_iter Capstone}
    C -->|decode ok| D[compute_sig arch-specific]
    D --> E{row_for exact sig}
    E -->|hit| H[ct_cost = row->cy cl_cost = row->cl]
    E -->|miss| F{row_for width-agnostic}
    F -->|hit| H
    F -->|miss| G{row_for sig==0 default}
    G -->|hit| H
    G -->|miss| I[fallback: 100 centi-cycles]
    C -->|decode fail| I
    H --> J[curr_inode->ct_cost / cl_cost]
    I --> J
    J --> K[BB finalise: compute ct_incl/cl_incl running sums]
    K --> L[Runtime: cachesim_add_icost cost += exe_count x ct_cost]
    K --> M[setup_bbcc side exit cost += ct_incl / cl_incl]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Ist_IMark during CLG_instrument] -->|cycle_estimation=yes, !seen_before| B[clg_cycle_cost bytes from cia len]
    B --> C{cs_disasm_iter Capstone}
    C -->|decode ok| D[compute_sig arch-specific]
    D --> E{row_for exact sig}
    E -->|hit| H[ct_cost = row->cy cl_cost = row->cl]
    E -->|miss| F{row_for width-agnostic}
    F -->|hit| H
    F -->|miss| G{row_for sig==0 default}
    G -->|hit| H
    G -->|miss| I[fallback: 100 centi-cycles]
    C -->|decode fail| I
    H --> J[curr_inode->ct_cost / cl_cost]
    I --> J
    J --> K[BB finalise: compute ct_incl/cl_incl running sums]
    K --> L[Runtime: cachesim_add_icost cost += exe_count x ct_cost]
    K --> M[setup_bbcc side exit cost += ct_incl / cl_incl]

_{Reviews (10): Last reviewed commit: "wip: use latest runner to fix samply" | Re-trigger Greptile}

GuillaumeLagrange

olgtm, do we have internal documentation on how we generated the LUT?

+ for curiosity: why do we need capstone? It could be made a bit clearer.
My understanding is that it's used to transform the instruction's operation to derive the ID for the LUT?

not-matthias · 2026-06-19T12:06:48Z

olgtm, do we have internal documentation on how we generated the LUT?
Yes, see AvalancheHQ/valgrind-helpers repository.

for curiosity: why do we need capstone? It could be made a bit clearer. My understanding is that it's used to transform the instruction's operation to derive the ID for the LUT?

We need to build a LUT for each instruction, but we can't just take the raw bytes as some have 64-bit immediate params. So what we have to extract the parts that identify an instruction. Intel's XED decoder has IFORM which would be super helpful here, as it can identify each instruction by it's category. For example, XED_IFORM_MOV_MEMb_GPR8_DEFINED is mov [reg], reg].

We can't use XED as it's only for x86_64 and not ARM. Which is why we manually reconstruct something similar with Capstone.

Add the regenerated x86_caps_lut.inc / arm64_caps_lut.inc cost tables consumed by the --cycle-estimation runtime. - x86: Zen4-tuned reciprocal-throughput table. - arm64: measured Cortex-A72 table, with a hand-frozen guide supplement for ops that are not benchmarked.

…-bit Capstone The amd64 host builds both the primary (amd64) tool and a 32-bit x86 secondary tool, but Capstone is only built 64-bit, so CLG_WITH_CAPSTONE is set only for the primary build. The secondary build compiled cycledecode.c without it and tripped the mandatory-Capstone #error. CodSpeed only ever runs the 64-bit tool, so build 64-bit only everywhere: add --enable-only64bit to the CI configure, the release deb (debian/rules, now unconditional), and the Justfile, and drop the now-unneeded gcc-multilib / libc6-dev-i386 deps. This also roughly halves build time by skipping the entire 32-bit toolchain.

Callgrind's cycle estimation links a static Capstone decoder. Add a build step to ci, codspeed and release workflows that compiles Capstone 5.0.9 x86+arm64 only (other printers reference libc symbols the -nodefaultlibs tool does not shim) and without stack-protector/fortify (the tool runs without glibc's %fs TLS), then exports its prefix as CAPSTONE_DIR for configure to pick up. Add cmake to the apt deps and forward CAPSTONE_DIR through debuild -e in the release build.

… estimation configure.ac gains --with-capstone=PATH (defaulting to $CAPSTONE_DIR) and makes a static Capstone mandatory for the native tool, compiling the decoder with fortify disabled since it links -nodefaultlibs. Makefile.am adds the cycledecode sources/headers, ships the LUT .inc tables, and passes the Capstone CFLAGS/LIBS. debian/rules forwards CAPSTONE_DIR to configure via --with-capstone.

Decode the real guest bytes of each instruction (via Capstone) at first translation and look up reciprocal-throughput (Ct) and latency (Cl) estimates in the cost table. Register an EG_CYCLES event group exposing Ct/Cl, accumulate self cost in the cache simulator and running inclusive sums per BB so the call-graph cost at each side exit is an O(1) lookup. Falls back to a flat 1.00 cycle (with a warning) on decode failure or no table match, and disables itself if Capstone is unavailable for the guest.

not-matthias force-pushed the feat/cycle-estimation branch from 4b2e2b7 to 22e0934 Compare June 11, 2026 17:56

not-matthias marked this pull request as ready for review June 12, 2026 06:49

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread callgrind/cycledecode.c Outdated

Comment thread callgrind/main.c

Comment thread bench/generate_config.py Outdated

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread callgrind/cycledecode.c Outdated

not-matthias requested a review from GuillaumeLagrange June 15, 2026 07:57

not-matthias force-pushed the feat/cycle-estimation branch 3 times, most recently from 650e97b to 86ac213 Compare June 18, 2026 13:17

GuillaumeLagrange approved these changes Jun 18, 2026

View reviewed changes

Comment thread .github/workflows/ci.yml Outdated

Comment thread .github/workflows/codspeed.yml Outdated

not-matthias added 8 commits June 19, 2026 14:09

chore: add cycle estimation benchmarks

2546b0b

chore: add nix dev-shell flake for valgrind-codspeed

4fdf103

wip: use latest runner to fix samply

2bc9a1c

not-matthias force-pushed the feat/cycle-estimation branch from 1a36a0a to 2bc9a1c Compare June 19, 2026 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: estimate cycles#17

feat: estimate cycles#17
not-matthias wants to merge 8 commits into
masterfrom
feat/cycle-estimation

not-matthias commented Jun 9, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 12, 2026 •

edited

Loading

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GuillaumeLagrange left a comment

Uh oh!

Uh oh!

Uh oh!

not-matthias commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

not-matthias commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by ×2.3

Performance Changes

Footnotes

Uh oh!

greptile-apps Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GuillaumeLagrange left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

not-matthias commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

not-matthias commented Jun 9, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 9, 2026 •

edited

Loading

greptile-apps Bot commented Jun 12, 2026 •

edited

Loading